Method and System for Assessment of Regulatory Variants in a Genome

ABSTRACT

The present invention provides methods embodied in a system that can be applied to genetic information comprising an individual genome to assess the regulatory impact of specific genetic variants and their possible impact on biological function or disease pathology.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to U.S. Provisional Application No. 61/526,242 filed Aug. 22, 2012, which is hereby incorporated by reference in its entirety for all purposes.

This application claims priority to U.S. Provisional Application No. 61/526,095 filed Aug. 22, 2012, which is hereby incorporated by reference in its entirety for all purposes.

GOVERNMENT RIGHTS

This invention was made with Government support under contracts HG000237 and HG004558 awarded by the National Institutes of Health. The Government has certain rights in this invention.

BACKGROUND OF THE INVENTION

Genome-wide association studies have discovered many genetic loci associated with disease, but the molecular basis of these associations is often unresolved. Genome-wide regulatory and gene expression profiles measured across individuals and diseases reflect downstream effects of genetic variation, and may allow for functional annotation of disease-associated loci.

Complete genome sequences of individual patients will soon become integrated as part of routine clinical care. There exists a need for interpreting the clinical significance of novel genetic variants presenting in a patient's personal genome, which are known to be associated with diverse clinical disorders. Current tools and approaches for clinical assessment of genetic variation do not explicitly consider gene regulatory information and are typically focused on specific gene coding regions.

SUMMARY OF THE INVENTION

Embodiments of the present invention enable genome-wide systematic evaluation of potentially clinically relevant genetic variation in a personal genome. Among other things, the present invention provides methods embodied in a system that can be applied to genetic information comprising an individual genome to assess the regulatory impact of specific genetic variants and their possible impact on biological function or disease pathology.

An embodiment of the invention is comprised of databases and algorithms embodied in software, where the databases contain information providing genome-wide quantitative and genetic profiles of transcription factor binding measured across a multitude of individual human genomes, information of genetic variants associated with disease conditions, information on DNA motifs associated with transcription factor binding, as well as molecular profiles of disease pathology.

Embodiments of the present invention use genome-wide quantitative gene regulatory information to assess total genetic variation presenting in an individual genome. The present invention provides the ability to infer transcription factor binding events from genotypes. Also, the present invention provides the ability to associate individual variation in gene regulatory regions with biological function and disease pathology.

Applications of the present invention include clinical assessment of personal genomes, clinical assessment of cancer genomes, and interpretation of genetic disease associations, discovery of regulatory DNA biomarkers. In other embodiments, additional types of gene regulatory information can be added.

These and other embodiments can be more fully appreciated upon an understanding of the detailed description of the invention as disclosed below in conjunction with the attached figures.

BRIEF DESCRIPTION OF THE DRAWINGS

The following drawings will be used to more fully describe embodiments of the present invention.

FIGS. 1A and 1B are block diagrams of computer systems on which embodiments of the present invention can be practiced.

FIG. 2 is a block diagram of a method according to an embodiment of the invention.

FIG. 3 is a block diagram of a method according to an embodiment of the invention.

FIG. 4 is a graph illustrating the variability of transcription factor binding in the analysis of regulatory variation.

FIG. 5 is an illustration of the manner in which a single SNP is associated with NFkB binding.

FIG. 6 is an illustration of the manner in which the EBF1 motif affects NFkB binding.

FIG. 7 is a graph demonstrating the manner in which disease-associated SNPs in NFkB binding regions are more pleiotropic.

FIG. 8 is a graph illustrating the effects of transcription factor binding and disease SNPs.

FIG. 9 is a graph illustrating the manner in which SNPs associated with inflammatory and autoimmune diseases were overrepresented in NFκB binding regions.

DETAILED DESCRIPTION

Among other things, the present invention relates to methods, techniques, and algorithms that are intended to be implemented in digital computer system 100 such as generally shown in FIG. 1A. Such a digital computer or embedded device is well-known in the art and may include the following.

Computer system 100 may include at least one central processing unit 102 but may include many processors or processing cores. Computer system 100 may further include memory 104 in different forms such as RAM, ROM, hard disk, optical drives, and removable drives that may further include drive controllers and other hardware. Auxiliary storage 112 may also be include that can be similar to memory 104 but may be more remotely incorporated such as in a distributed computer system with distributed memory capabilities.

Computer system 100 may further include at least one output device 108 such as a display unit, video hardware, or other peripherals (e.g., printer). At least one input device 106 may also be included in computer system 100 that may include a pointing device (e.g., mouse), a text input device (e.g., keyboard), or touch screen.

Communications interfaces 114 also form an important aspect of computer system 100 especially where computer system 100 is deployed as a distributed computer system. Computer interfaces 114 may include LAN network adapters, WAN network adapters, wireless interfaces, Bluetooth interfaces, modems and other networking interfaces as currently available and as may be developed in the future.

Computer system 100 may further include other components 116 that may be generally available components as well as specially developed components for implementation of the present invention. Importantly, computer system 100 incorporates various data buses 118 that are intended to allow for communication of the various components of computer system 100. Data buses 118 include, for example, input/output buses and bus controllers.

Indeed, the present invention is not limited to computer system 100 as known at the time of the invention. Instead, the present invention is intended to be deployed in future computer systems with more advanced technology that can make use of all aspects of the present invention. It is expected time computer technology will continue to advance but one of ordinary skill in the art will be able to take the present disclosure and implement the described teachings on the more advanced computers or other digital devices such as mobile telephones or “smart” televisions as they became available.

Moreover, the present invention may be implemented on one or more distributed computers. Still further, the present invention may be implemented in various types of software languages including C, C++, and others. Also, one of ordinary skill in the art is familiar with compiling software source code into executable software that may be stored in various forms and in various media (e.g., magnetic, optical, solid state, etc.). One of ordinary skill in the art is familiar with the use of computers and software languages and, with an understanding of the present disclosure, will be able to implement the present teachings for use on a wide variety of computers.

The present disclosure provides a detailed explanation of the present invention with detailed explanations that allow one of ordinary skill in the art to implement the present invention into a computerized method. Certain of these and other details are not included in the present disclosure so as not to detract from the teachings presented herein but it is understood that one of ordinary skill in the art would be familiar with such details.

In an embedment of the invention as shown in FIG. 1B, a computer server that implements certain of the methods of the invention is remotely situated from a user. Computer server 122 is communicatively coupled so as to receive information from a user; likewise, computer server 122 is communicatively coupled so as to send information to a user. In an embodiment of the invention, the user uses user computing device 124 so as to access computer server 122 via network 126. Network 126 can be the internet, a local network, a private network, a public network, or any other appropriate network as may be appropriate to implement the invention as described herein.

User computing device 124 can be implemented in various forms such as desktop computer 128, laptop computer 130, smart phone 132, or tablet device 134. Other devices that may be developed and are capable of the computing actions described herein are also appropriate for use in conjunction with the present invention.

In the present disclosure, computing and other activities will be described as being conducted on either computer server 122 or user computing device 124. It should be understood, however, that many not all of such activities may be reassigned from one the other device while keeping within the present teachings. For example, for certain steps computations that may be described as being performed on computer server 122, a different embodiment may have such computations performed on user computing device 124.

In an embodiment of the invention, computer server 122 is implemented as a web server on which Apache HTTP web server software is run. Computer server 122 can also be implemented in other manners such as an Oracle web server (known as Oracle iPlanet Web Server). In an embodiment computer server 122 is a UNIX-based machine but can also be implemented in other forms such as a Windows-based machine. Configured as a web server, computer server 122 is configured to serve web pages over network 126 such as the internet.

In an embodiment, user computing device 124 is configured so as to run web browser software. For example, where user computing device 124 is implemented as desktop computer 128 or laptop computer 130, currently available we browser software includes Internet Explorer, Firefox, and Chrome. Other browser software is available for different applications of user computing device 124. Still other software is expected to be developed in the future that is able to execute certain steps of the present invention.

In an embodiment, user computing device 124, through the use of appropriate software, queries computer server 122. Response to such query, computer server 122 provides information so as to display certain graphics and text on user computing device. In an embodiment, the information provided by computer server 122 is in the form of HTML that can be interpreted by and properly displayed on user computing device 124. Computer server 122 may provide other information that can be interpreted on user computing device.

It has been found that transcription factor binding is significant in the analysis of regulatory variation. For example, as shown in FIG. 4, whereas individuals vary very little at approximately 1/1300 bp and vary at approximately 1/100 in promoter regions, transcription factor binding is much more variable. As shown in FIG. 4, promoter sequences 406, coding region nucleotides 404, and amino acid sequences 408 vary very little. Note, however that, transcription factor binding 402 has been found to vary in approximately 7.5% of NFkB binding sites. For purposes of personal genome sequencing, it has been found that 7.5% of SNPs are in transcription factor binding sites and 21% of SNPs are in DNase1 HS sites. Embodiments of the present invention make use of these facts for purposes of improved functional and clinical analysis. For example, a single SNP is associated with NFkB binding as shown in FIG. 5.

In a certain sense, it is important to consider how binding of one factor can affect the binding of another. For example, as shown in FIG. 5, Stat1 motif 502 affects Stat1 504 binding. But it is also to important to note that Stat1 504 binding affects NFkB 508 binding, for example, despite the conservation of the NFkB motif 510. More particularly, for example, it has been found that EPF1 motif affects NFkB binding as shown in FIG. 6.

A systematic approach is presented herein to combine disease association, transcription factor binding, and gene expression data to assess the functional consequences of variants associated with hundreds of human diseases. In an analysis of genome-wide binding profiles of NFκB, it was fond that disease-associated SNPs are enriched in NFκB binding regions overall, and specifically for inflammatory mediated diseases, such as asthma, rheumatoid arthritis, and coronary artery disease. Using genome-wide binding variation information for eight fully sequenced individuals, it was found that regions of NFκB binding correlated with disease-associated variants in an allele-specific manner (see pipeline method of FIG. 3 to be discussed below). Furthermore, it was found that this binding variation is often correlated with expression of nearby genes, which are also found to have altered expression in independent profiling of the variant-associated disease condition. In this systematic approach, a loop is closed in biological context-free association studies and assign putative function to many disease-associated SNPs. In other embodiments of the invention, these predictions can be validate for atherosclerosis, asthma, and/or rheumatoid arthritis. It should, therefore, be noted that although certain particular embodiments will be discussed herein, such descriptions are illustrative and do not limit the scope of the present invention.

The association between genotype and phenotype is a fundamental problem in biology and translation medicine. Genome-wide association studies (GWASs) have identified many genetic variants associated with diseases [see Hindorff, L. A. et al. Potential etiologic and functional implication of genome-wide association loci for human diseases and traits. Pro Natl Acad Sci USA 106, 93629367 (2009); note that these and other references cited herein are incorporated by reference for all purposes], but such approaches rely on “tag” single nucleotide polymorphisms (SNPs) found on DNA microarrays. While these SNPs may lie in or near gene regions, their specific influences on the biology of disease are not necessarily determined in typical GWASs [Green, E. D., Guyer, M. S. National Human Genome Research Institute Charting a course for genomic medicine from base pairs to bedside. Nature 470, 204-213 (2011)]. Furthermore, disease-associated SNPs that are found outside of genic regions are often not further investigated because they are of unknown function.

Systems biology can provide an approach to bridge the gap between genotype and phenotype. For example, human variation in transcription factor (TF) binding has been correlated with polymorphisms in motifs for NFκB and Po1II in ten individuals [Kasowski, M. et al. Variation in transcription factor binding among humans. Science 328, 232-235 (2010); Karczewski, K. J. et al. Discovering Cooperative Transcription Factor Associations using Binding Variation Information and the ALPHABIT Pipeline. 1-25 (2011)] and regulatory features across dozens of cell lines have been mapped extensively by the ENCODE project [Birney, E. et al. Identification and analysis of functional element in 1% of the human genome by the ENCODE pilot project. Nature 447, 799-816 (2007)].

It is, therefore, expected that polymorphisms that affect transcription factor binding can have a significant influence on disease because the differences in TF binding (that lead to downstream differences in expression) may be the true underlying cause of the disease association of the SNPs. These functional biology-rich sources of data can, therefore, be leveraged to suggest putative function for previously unannotated disease-associated SNPs.

In the present disclosure, the role of transcription factor binding sites in disease is described. As a non-limit case study, genome-wide enrichments are explored for disease SNPs in NFκB (p65) binding regions to predict genotype-specific binding events associated with disease.

Shown in FIG. 2 is a block diagram of a method according to an embodiment of the present invention. As shown in step 202, information is received regarding genome-wide quantitative and genetic profiles of transcription factor binding measured across a multitude of individual human genomes. At step 204, information is received regarding genetic variants associated with disease conditions. Information regarding DNA motifs associated with transcription factor binding is received at step 206. At step 208, molecular profiles of disease pathologies are received. In an embodiment of the present invention, the information of steps 202 through 208 is contained in databases. In an embodiment, certain of the information of steps 202 through 208 is automatically generated or manually curated as further disclosed in copending application Ser. No. 13/592,292, entitled “Method and System, for the Use of Biomarkers for Regulatory Dysfunction in Disease,” which is herein incorporated by reference for all purposes. In an embodiment of the invention, the databases are maintained locally on a computer system on which the method of the present invention are processed. In another embodiment of the invention, the databases are maintained remotely from a computer on which, certain steps of the present invention are performed.

As shown in step 210 of FIG. 2, the regulatory impact of specific genetic variants using the received information is assessed. Examples this analysis are provided below, however, those of ordinary skill in the art will understand that many other types of analyses are possible without deviating from the teachings of the present invention. Further processing can be performed such as shown in step 212 where the impact of genetic variants on biological function is assessed. Also, an embodiment of the present invention further performs step 214 where the impact of genetic variants on disease pathology is assessed. One of ordinary skill in the art will understand, however, that many other variations of the present invention are possible.

NFκB Binding Regions Are Enriched for Disease Associated SNPs

In a particular embodiment of the present invention, a compendium of disease SNPs [see Ashley, E. A. et al. Clinical assessment incorporating a personal genome. Lancet 375, 1525-1535 (2010); Chen, R., Davydov, E. V., Sirota, M. & Butte, A. J. Non-synonymous and synonymous coding SNPs show similar likelihood and affect size of human disease association. PLoS ONE 5, e13574 (2010)] was intersected with a set of 15,522 NKκB binding regions found in lymphoblastoid cell lines from ten individuals [Kasowski, M. et al. Variation in transcription factor binding among humans. Science 328, 232-235 (2010)]. It was found that established disease-associated SNPs were overabundant in regions bound by NFκB (χ²=292.9; p=1.1e-65; Fisher's OR=2.95). These associations are not biased by the platforms used for disease association discovery, as NFκB regions are underrepresented on Affymetrix 6.0 and 500K arrays (Fisher's OR=0.8 and 0.82, respectively), and only slightly overrepresented on Illumina 550K and 1M (Fisher's OR= and 1.42, respectively), which represented a smaller portion of this analysis. Additionally, binding sites of a known interacting factors, Stat1, were also highly enriched for disease-SNPs; this enrichment was not present in promoter regions, as defined by Po1II binding as shown in FIG. 8.

As shown in FIG. 9, SNPs associated with inflammatory and autoimmune diseases, including Rheumatoid Arthritis, Asthma, and Systemic Lupus Erythematosus, were highly overrepresented in NFκB binding regions. These SNPs were enriched compared to al SNPs as well as the subset of disease-associated SNPs.

Disease-associated SNPs in NFκB binding regions are more pleiotropic (e.g., typically associated with more diseases) than the collection of known disease-associated SNPs (1.33 vs. 1.15; t-test p-value=8.7e-4; Mann Whitney U-test p-value=2.7e-7). For example as shown in FIG. 7, on average, it was found that 1.5 diseases are associated per SNP and the average disease-associated SNP in NFkB binding regions was 1.33 diseases per SNP. In FIG. 7, note that areas 702 correspond to NDkB SNPs and areas 704 correspond to all disease SNPs.

Disease Associated SNPs Are Found In More Biologically Relevant Binding Regions

NFκB binding regions that harbor disease-associated SNPs are more strongly bound by NFκB, as determined by ChIP-Seq binding intensity, compared to the background of al NFκB binding regions. Additionally, these binding regions are less variable, indicating the potential for evolutionary constraint on these regions.

SNPs in NFκB Binding Regions Suggest a Mechanism For The Biology of Disease

In a systematic effort to assign functional annotation to disease-associated SNPs, a pipeline as shown in FIG. 3 was developed to discover putative SNPs that may be associated with an effect on NFκB binding. As shown in FIG. 3, at step 302 a compendium of 24K disease-associated SNPs was received from a database. At step 304, 15,522 genome-wide transcription factor binding profiles was received. At step 306, the NFκB binding profiles for eight individuals was considered. Genome-wide expression profiling was then performed (e.g., RNA-Seq). Finally, at step 310, the human genomic disease expression information was collected in a database.

Using genotype and NFκB binding information from eight individuals [Kasowski, M. et al. Variation in transcription factor binding among humans. Science 328, 232-235 (2010)], a preliminary, lower-power analysis was performed to identify candidate SNPs. In an assessment of SNPs in NFκB binding regions in linkage disequilibrium (R2>0.5) with a disease-associated SNP, SNPs associated with NFκB binding were found by an ANOVA. For instance, rs6135095, a SNP previously reported to be associated with atherosclerosis, shows significant association between genotype and NFκB binding in the 8 cell lines queried.

These variants associated with NFκB binding were linked with downstream expression effects of nearby genes. Considering all genes within 200 kb to be potential targets, disease-associated SNPs were found to be associated with changes in NFκB binding which were correlated with expression of nearby genes.

In an independent validation experiment, aortic tissues from 10 individuals were genotyped and certain of them certain of them were found with rs6135095 and TT. This SNP was found to be associated with binding of NFκB (by ChIP-qPCR) as well as expression of nearby genes (SIRPG, etc).

ENCODE/HS Sites

Poll binding regions were not overrepresented for disease SNPs. Therefore, an enrichment for disease-associated SNPs was computed for various factors in several cell lines and it was found that these SNPs were overrepresented in certain of those factors.

Data Sources

Data on disease-SNP associations (p<0.01) were used as in [Ashley, E. A. et al. Clinical assessment incorporating a personal genome. Lancet 375, 1525-1535 (2010); Chen, R., Davydov, E. V., Sirota, M. & Butte, A. J. Non-synonymous and synonymous coding SNPs show similar likelihood and effect size of human disease association. PLoS ONE 5, e13574 (2010)]. ChIP-Seq data on eight cell lines with individual genome sequences was obtained from [Kasowski, M. et al. Variation in transcription factor binding among humans. Science 328, 232-235 (2010)]. All analyses were performed using dbSNP release 132 and hg19 coordinates.

ASB/ASE

ChIP-Seq reads were mapped to hg19 assembly of the human genome using BWA. PCR duplicates were filtered using Picard tools. Variant calling files were downloaded from 1000 Genomes and converted to hg19 coordinates with VCF tools. Allele-specific binding (ASB) was determined on a per-heterozygote per-individual basis for the ten individuals. Reads were filtered to be above MAQ 30 mapping quality. For each individual, a binomial probability of success was determined based on the probability that a reference allele maps to the genome compared to a non-reference. Allele-specific expression (ASE) was similarly determined using reads from the transcriptome of each individual.

Statistical Analysis

Overall associations between NFκB binding regions and disease-associated SNPs were ascertained by chi-squared and Fisher's exact tests. Associations between individual SNPs and binding strengths were tested by two sample t-test (with two genotypes grouped) or ANOVA for all 3 genotypes. All statistical analysis methods were performed using R statistical software (2.12.1).

Using embodiments according to the present invention, the previously unknown functional significance of regulatory variants is possible. Indeed, embodiment of the present invention can be used to discover new transcription factor interactions. For example, using the present invention disease-associated variants can be connected to molecular pathophysiology and can explain the function of non-coding SNPs.

It should be appreciated by those skilled in the art that the specific embodiments disclosed above may be readily utilized as a basis for modifying or designing other techniques for carrying out the same purposes of the present invention. It should also be appreciated by those skilled in the art that such modifications do not depart from the scope of the invention as set forth in the appended claims. 

1. A method of treating an individual based on genetic variants in transcription factor binding sites, comprising: obtaining genetic data from an individual by sequencing genetic material of the individual; identifying at least one disease in the individual using a computer system comprising a processor and a memory, wherein the at least one disease is associated with gene regulation, wherein the at least one disease is identified by: receiving information regarding genome-wide quantitative and genetic profiles, information regarding genetic variants associated with disease conditions, information regarding DNA motifs associated with transcription factor binding, and molecular profiles of disease pathologies and storing this information using a computer system comprising a processor and a memory, wherein the quantitative and genetic profiles describe transcription factor binding and gene expression measured across a multitude of individual human genomes; identifying a set of candidate variants using a computer system with a processor and a memory, wherein the set of candidate variants are identified by: mapping the genetic variants associated with disease conditions to the DNA motifs associated with transcription factor binding; and determining whether the genetic variants associated with disease conditions affect gene expression by identifying the effect on gene expression of the mapped genetic variants from the quantitative and genetic profiles; generating regulatory impact data for the set of candidate variants using a computer system comprising a processor and a memory, wherein the regulatory impact data indicates the clinical significance of the set of candidate variants, where the clinical significance identifies a disease associated with the effect of the set of candidate variants on gene expression based on the received molecular profiles of disease pathology; and identifying the presence of at least one candidate variant from the set of candidate variants in the obtained genetic sequence data of the individual using a computer system comprising a processor and a memory; and treating the individual for the at least one identified disease based on the clinical significance described by the regulatory impact data associated with the identified at least one candidate variant.
 2. The method of claim 1, wherein the quantitative and genetic profiles describe NFkB binding measured across a multitude of individual human genomes, and the DNA motifs associated with transcription factor binding are NFkB motifs.
 3. The method of claim 1, further comprising performing genome-wide expression profiling of the individual to confirm the clinical significance of the at least one candidate variant.
 4. The method of claim 9, wherein the genome-wide expression profiling includes an RNA sequencing analysis.
 5. The method of claim 1, further comprising: determining the impact of the at least one candidate disease variant on disease pathology by comparing the at least one candidate variant to the molecular profiles of disease pathologies using a computer system comprising a processor and a memory.
 6. The method of claim 1, wherein the disease is at least one of an autoimmune disease and an inflammatory disease.
 7. The method of claim 1, wherein the disease is at least one of Alzheimer's disease, rheumatoid arthritis, type 1 diabetes, systemic lupus erythematosus, asthma, alopecia areata, non-Hodgkin's lymphoma, malaria, drug-induced liver injury, myocardial infarction, and atherosclerosis.
 8. The method of claim 1, wherein the genetic material comprises at least one of immunoprecipitated chromatin, genomic DNA, and RNA. 